Copyright © 1982 American Telephone and Telegraph Company 

The Bell System Technical Journal 

Vol. 61, No. 9, November 1982 

Printed in U.S.A. 



Database Systems: 



Database Administration System — 
Architecture and Design Issues 

By C. C. WANG and C. P. HUANG 

(Manuscript received September 22, 1982) 

The Database Administration System (dbas) is a software system 
designed for the Bell Operating Companies to administer several 
remote, on-line, call-processing-related databases. These remote 
databases include, for example, the Billing Validation Application 
files associated with mechanized calling card service, and support 
for the Automatic Intercept Centers. Briefly, DBAS accepts service- 
order inputs and forwards them to other databases. DBAS serves as a 
buffer between the high-speed, real- time -sensitive billing validation 
applications and low-speed, nonuniform, service-order inputs, dbas 
also provides an on-line database to support various administrative 
functions for the Bell Operating Companies. The major challenge to 
the dbas design lies in the size of the database (up to 12-million 
telephone station records) and its throughput update volume (up to 
100,000 random updates per 10-hour day). 

I. INTRODUCTION 

The Data Base Administration System (dbas) is a PDP 11/70 
computer system under the control of a real-time UNIX* operating 
system designed for the Bell Operating Companies (bocs) to adminis- 
ter several remote, on-line, call-processing-related databases. These 
remote databases, among others, include the billing validation appli- 
cation (bva) files located at different network control points for the 
purpose of providing mechanized calling card service. 1 " Briefly, dbas 



* UNIX is a trademark of Bell Laboratories. 

f Other remote databases, administered by the dbas, are not-in-service telephone 
number data located at the automatic intercept centers and originating (telephone) 
station treatment data located at the traffic service position systems. 
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accepts service-order inputs and forwards them to the bva databases. 
Because the bva databases are queried under real-time constraints for 
processing telephone calls, direct access to them by the bocs would 
degrade the performance of the bvas. dbas serves as a buffer between 
the high-speed, real-time-conscious bvas and the low-speed, nonuni- 
form, service-order inputs. An on-line database at the dbas site is 
introduced to further relieve the load of the bvas from most associated 
administrative functions: for each telephone station with data located 
at a bva, a dbas on-line database contains a superset of the data about 
that station. This superset data includes what is needed for providing 
mechanized calling card service and much more indirectly related data 
needed by the bocs for administrative purposes. 

The major challenge of the dbas design lies in the size of the 
database and its throughput update volume. A large dbas database 
consists of up to 12 million telephone station records and has the 
capacity to process up to 100,000 random updates per 10-hour day. 
These figures are equivalent to (i) on-line secondary storage size of 
close to 1 billion bytes and (ii) a limitation to no more than nine disk 
accesses for an average random record update. 

Crash recovery is also very important in the dbas design. Note that 
the dbas database itself is not part of the switching system for call 
processing, whereas the bva is. A duplex design of the dbas database 
to ensure its high degree of availability would be too expensive because 
no calls are missed when the dbas is down. On the other hand, the 
dbas database must be available most of the time in spite of system 
failures because most bocs plan to operate their dbass on a six-days- 
per-week schedule. Since initial loading or reloading of a dbas database 
may take from two to five days, the integrity of the database must be 
maintained at all times so that recovery from a system failure rarely 
requires reloading the dbas database. 

Given the size of the dbas database, no existing general-purpose, 
minicomputer-based database management system (dbms) satisfies 
the dbas update throughput requirement. Clearly, the dbas applica- 
tion is a special-purpose one. In planning the dbas architecture, the 
easy way is to decide what should be included and what should be 
removed from a typical dbms to meet the dbas application. For 
example, multiple views of the database are supported to provide data 
independence among different application programs (aps) accessing 
the database. These aps include clerk input, administrative queries, 
order processing, order transmitting (to bvas), and audit (between bva 
and dbas databases). Multiple views allow the flexibility of late binding 
time among these aps. The interactions among these aps are minimized 
so that they can be programmed with ease by different people at the 
same time. Transaction processing_Js_jio±_^up^orted because a 
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Table I — Characteristics of DBAS 



Supporting systems PDP 11/70 

C language 

Real-time UNIX* operating system 
UNITY database-management system 

Data model Relational 

Application programs in C language 
Access database through function calls 

Features Multiple users 

Multiple user views 

Multiple reads and writes 

Extendible hashing file access 

Variable length records 

Secondary storage management 

Database checkpoint 

Hierarchical locking 

Deadlock prevention 

Secondary key retrieval 

Buffer cache 

Shared segment 

Separate read-only and writable disks 

Simplicity in coding and debugging Message 

User space code 
Synchronous, physical I/O 

Capacity and performance 6 RP-06 disks 

12 million records 

3 disk accesses per random record retrieval 

Average 4 to 5 disk accesses per order update 

10,000 order updates per hour 

100,000 records per hour at initial load 

* UNIX is a trademark of Bell Laboratories. 



"transaction commit" usually requires more disk accesses per record 
update and this hinders the objective of pushing the high volume of 
updates through the database. However, a database checkpoint 
scheme is implemented to facilitate crash recovery. An existing data- 
base-management package, UNITY, 1 was adopted for high-level query 
processing at an early stage of the project so that the available human 
resources can be directed at designing efficient lower-level access 
modules. The lower-level modules are directly responsible for meeting 
the throughput objective. Table I lists the major features of DBAS. 

This paper does not cover any of the aps. The overall architecture 
is described next. Design issues and their solutions by various com- 
ponents of the lower-level access modules are detailed in the remainder 
of the paper. 

II. ARCHITECTURE 

From the dbas database viewpoint, aps fall into one of two cate- 
gories: (i) order processing and (ii) administrative report generating 
and query processing. Order processing is the primary objective of the 
dbas in supporting mechanized calling card service. Report generating 
and query processing serve as the only interfaces between the machine 
and the boc administrators. Order processing, which runs all day in 
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Fig. 1— DBAS architecture. 

the background, is disk i/o intensive. Report and query programs, run 
in the foreground, are not as complex as the queries that would appear 
in a general-purpose DBMS. Response time is important in most data- 
base designs. However, the throughput rate is the main concern in the 
DBAS design. 

The block diagram in Fig. 1 reflects the above perception of the aps. 
All effort is put in the design of a set of highly efficient lower-level 
access modules. They are made not only directly accessible to the 
order processing programs but also to other aps. A random record 
retrieval and its subsequent update take, on the average, from four to 
five disk accesses. For high-level processing, a UNITY interface mod- 
ule is planned. The UNITY interface module is to retrieve data from 
the dbas database via the lower-level access modules and convert 
them to the relational format required by the UNITY command 
modules. The relational operators available in the UNITY command 
modules are then used for high-level query processing. 

2. 1 Data models 

Telephone stations, their associated equipment and services are the 
main entities concerning the dbas database. There are conceptually 
two types of relations. They are the Billing Number Group (bng) 
relation type and the Billing Number Record (bnr) relation type. 2 
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These two types are hierarchically related: for each tuple in the bng 
type relation there is an instance of the bnr type relation (containing 
all bnr tuples for that bng). The tuples of the bng type and bnr type 
relations are respectively called the bng records and bnr records. 

A bnr record has the 10-digit telephone station number as its key. 
The first 6 digits of a 10-digit telephone number identifies a unique 
bng record and is also the key of the bng record. The tuples of a bnr 
relation represent all active telephone stations with identical six-digit 
prefixes in their telephone numbers. The lower-level access module 
does not provide high-level data-manipulation language operations. 
The basic database access functions provided at the tuple level include 
(i) retrieve, (ii) store, (Hi) replace, and (iv) delete a tuple. At the 
relational level, a complete relation of either the bng or the bnr type 
can be retrieved. A complete bnr relation can also be stored or deleted 
from the database in a single request. A lower-level access module 
supports a restricted form of predefined views under which an ap may 
access the database. 

2.2 Message and processes 

Simplicity in design is the key to the success of a project. We have 
strived not to duplicate any functions already supported by the UNIX 
operating system unless the throughput objective is at stake. In the 
area of interface between an ap and a database process, the following 
constraints are observed to achieve simplicity: (i) each ap has at most 
one outstanding database request and waits while the request is being 
serviced, (ii) a database process services one request at a time and 
uses no multi-tasking nor asynchronous i/o techniques, and (Hi) 
database requests and replies are communicated between an ap and a 
database process using messages. However, multiple aps can issue 
database requests to a database process at the same time. They are 
served in the FIFO order. Even though messages are expensive in terms 
of machine instructions, their usage minimizes the asynchronous con- 
trol problem in dealing with multiple database requests from different 
aps. The handicap of using messages is minimized by passing almost 
all data among database-related processes through the shared seg- 
ments. 

The dbas database-management functions are partitioned into more 
than one process because (i) they are too big to fit into the address 
space of one process and (ii) system performance would suffer if both 
complex low-frequency and simple high-frequency types of database 
requests from multiple aps were all served by a single database process. 
The database process would clearly be the bottle neck and would not 
be able to take full advantage of the multiprogramming services offered 
by the UNIX operating system. The dbas database processes (Fig. 2) 
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Fig. 2— DBAS Process. 



consist of a single Database Manager (dbm) and several identical 
Access Task Processes (atps). The dbm assumes all work that is best 
suited for a single, centralized process to do. For example, since the 
free page-address stack is shared and accessed by all database proc- 
esses for the purpose of allocation and deallocation of disk pages, the 
jobs of replenishing the stack when it is empty and managing stack 
overflow are the sole responsibility of the dbm. Specifically, the dbm 
sets up and initializes most of the data structures required in the 
shared segments, semaphores variables for use in dealing with critical 
sections, and processes database checkpoints. The atps are restricted 
to moving data in and out of the database according to requests by 
aps. The binding of an ap's view and its server atp occurs at the 
database view opening time. The benefit is, again, in simplicity of 
design for no dynamic server scheduler is needed in offering multiple 

ATPS. 

2.3 Data-manipulation primitives 

Table II lists a set of data-manipulation primitives for an ap to 
interact with database processes. They are implemented as a set of 
standard library functions residing in each ap's address space. Mes- 
sages sent to and received from the database processes are embedded 
in these routines and therefore transparent to the aps. The primitives, 
OPENDB and CLOSEDB, also connect and disconnect the shared 
segment in the ap's address space when necessary. When a view is 
opened by an ap, the dbm allocates a system work area in the shared 
segment. For each database request, the work done by the correspond- 
ing routine on the ap side includes moving data between the ap's 
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Table II — Data manipulation primitives 

DBM ATP 

OPENDB RETRIEVE* 

CLOSEDB STORE* 

BEGIN—SESSION DELETE* 

END_SESSION REPLACE 

READ_TRACE LOCKDB 

RETRIEVE BY_DATE FREEDB 

* The amount of data is either a tuple or a relation. 



workspace and a specific system work area in the shared segment, and 
sending a message to and waiting for a reply from the database 
processes. 

The amount of data passing through the system work area is one 
tuple at a time. The format of the tuple, including the ordering and 
data types of the fields in the tuple, is according to an ap's view. An 
entire relation, in a restricted form, can also be retrieved in a single 
retrieve command: in this case, the bulk of the data is passing from 
the atp to the requesting ap in a file. 

2.4 Shared segments 

The database processes rely heavily on the shared segments sup- 
ported by the real-time UNIX operating system. They are used as 
storage for common data, as means to save core space, and to get 
around the limitation of small virtual address space imposed by the 
machine. A shared segment is also used in moving data between an ap 
and database processes. 

Segments used in dbas are named (i) AP-segment, (ii) ATP-segment, 
and (Hi) buffer-segment. The AP-segment is shared among the dbm, 
all atps, and all aps accessing the database. For each view opened by 
an ap, a unique system work area in the AP-segment is allocated for 
the purpose of moving data between the ap and atps. The ATP-segment 
is internal to the database processes and not shared with the aps. Data 
structures, shared internally among the database processes, include, 
among others, the free-page address stacks, the top level of the 
database tree, and concurrency control structures. Most structures on 
the AP-segment and ATP-segment are dynamically allocated and freed. 
Routines of the UNIX operating system are modified for the purpose 
of managing the individual segment space. 

Buffer caches are implemented through buffer-segments. A con- 
nected segment always occupies the same address space within an atp 
so that address pointers within a segment are always meaningful. An 
atp can connect to only one buffer-segment. There are more buffer- 
segments than there are atps. A buffer-segment, disconnected from an 
atp and saved at the end of a retrieval-type database request of an 
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order update, is most likely reconnected by the atp to process the 
subsequent replacement type database request from the same ap. This 
type of buffer cache eliminates three disk accesses per order update 
(i.e. per-paired retrieval and replacement- type database requests). 

III. DESIGN AND IMPLEMENTATION ISSUES 
3. 1 File structure and file access 

The dbas application and its throughput requirement impose the 
following constraints on the database file structure design. 

• A file large enough to deal with up to 12 million records. 

• Records of variable length. They can be dynamically inserted and 

deleted. 

• Access of a random record that requires as few disk accesses as 

possible. 

• Facility to retrieve all bng records and all bnr records of a given 
office code, npa-nxx. 

The reasons that the file and directory structures of the real-time 
UNIX operating system can not meet the needs are discussed in 
Section 3.4. Among the other candidates for the choice of a suitable 
file structure, the B-tree 3 and extendible hashing methods 4 ' 5 require 
the fewest number of disk accesses in retrieving or storing a random 
BNR record out of a population of 12 million. An order update accesses 
both a bnr record and its hierarchically related bng record. The 
interesting problem is how one structures the bng records in an 
efficient manner such that they do not cost additional disk accesses in 
an order update. The solution used in dbas is a two-stage extendible 
hashing algorithm. The structure of the data (Fig. 3) is essentially a 
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Fig. 3— Primary database structure. 
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two-level hierarchical one consisting of bng records of an operating 
telephone company at the first level, and bnr records (of individual 
billing numbers) at level two. Extendible hashing is used at both levels 
to access respectively the bng record and the bnr record. 

3.1.1 Extendible hashing algorithm 

Briefly, given a key, the algorithm first computes its hashed key 
value (hkv), and then makes use of the hkv to search a binary tree, 
stored in bit- vector form, to compute a logical page address (la). The 
la is a mask of the low-order bits of the hkv. The size of the mask 
(number of bits) depends upon the level of the node in the binary tree. 
From la and a logical address-to-physical address map (LP AM), the 
corresponding physical page address (pa) is known. The disk page at 
pa is read into the memory buffer, and the records within the buffer 
are sequentially searched until the one with the matched key value is 
found. 

The process of storing a record follows the same steps as searching 
a record in bringing a page into the memory buffer. Room for the new 
record within the unused area of the (page) buffer is allocated. The 
new record is then moved into its assigned area of the buffer and the 
buffer content is written out onto disk to complete the insertion 
operation. In case the page brought into memory does not have enough 
room for the new record, a new page is allocated. The records within 
the page just brought in and the new record are distributed among the 
two pages according to their hashed key values. The two pages are 
each assigned a new logical page address according to the hashed key 
values of the records it contains. The bit vector, which represents the 
relationship of all defined logical page addresses, is consequently 
updated. Similarly, the entries corresponding to the affected logical 
page addresses within the LP AM are also updated. 

3. 1.2 Data structures used in extendible hashing 

The following types of input data structures are needed for each 
application of the extendible hashing algorithm: 

• A bit vector, its height and size. 

• The LP AM pages. The size of the LP AM varies dynamically. It 
can occupy several disk pages. Its size grows and shrinks more 
easily if it is not required to occupy consecutive physical page 
addresses. 

3.2 Concurrency 

Concurrent operations are necessary for the dbas to achieve its 
throughput objectives. At the hardware level, multiple disk controllers 
are used to maximize the data-path bandwidth between the disks and 
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the main memory. At the software level, multiple aps can access the 
database at the same time. Multiple copies of the atp are simultane- 
ously running to serve the aps to minimize the CPU idle time. 

Locks are used to solve the data-conflict problem. In order to 
simplify the implementation of any locking scheme, locks on logical 
records are assumed to be translated into equivalent locks on the 
corresponding physical pages. The locks are chosen so that (i) they 
are easy to use and to design, and (ii) they provide a high degree of 
concurrency. The following considerations are immediately noted. 

3.2. 1 Logical and physical locks 

A race condition exists when two different aps attempt to update 
the same data at the same time. The negative consequence of a race 
condition is that the database may no longer be consistent. Data 
conflict may occur at either the logical record level when aps access 
the same bng record (or the same bnr record), or the physical record 
level when aps access different bng records (or bnr records) that 
happen to be located on the same physical page. Since physical pages 
are transparent to the aps, locking on physical pages takes place 
implicitly when an ap explicitly locks a logical record. 

3.2.2 Lock granularity 

Clearly, the smaller the sizes of lock granules are, the higher is the 
degree of concurrency that can be achieved. However, locking of a bnr 
record, the smallest logical record in the dbas database, necessitates 
the implicit locking of several physical pages. There is therefore a 
direct link between the lock granularity and complexity in implement- 
ing it. 

3.2.3 The location of a lock 

The locks can be kept in the main memory or on the disk next to 
where the locked data item is. The shortcomings of keeping them on 
disk include (i) more disk accesses are required in accessing the locks, 
and (ii) the data items locked by an aborted ap may become perma- 
nently inaccessible. 

3.2.4 Deadlock 

When an ap is allowed to issue more than one lock, there is the 
probability of deadlock occurring. 6 Locks left by aborted aps may also 
introduce the deadlock problem. 

The dbas locking scheme is a much simplified version of the hier- 
archical locks described in Ref. 7. An ap can either lock the entire 
database or a bng record for exclusive access. When a bng record is 
locked, the associated physical page containing the bng record is 
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implicitly locked. Because of the size of the database, the probability 
of two random bng records residing on the same page is much less 
than one hundredth. Even though the size of the lock granule is grossly 
large, the degree of concurrency is adequate for dbas applications. 
Moreover, the distinction between logical and physical locking is 
practically eliminated from the implementation. An ap is permitted to 
own at most one lock at any time so that deadlock due to multiple 
active aps waiting for one another can be prevented. Locks left behind 
by aborted aps are cleared periodically at about 5-minute intervals to 
avoid causing the system to wait indefinitely. 

Data structures for the locks and their corresponding queues are 
kept in the main memory. Specifically, since they have to be accessed 
by all the atps, they are located in the shared ATP-segment. Lock and 
free lock can be either stand-alone database requests or piggy-backed 
to other types of database requests such as retrieve and replace to 
minimize the message overhead in using them for order processing. 

3.3 Database checkpoint 

The dbas database must be reliable. It does not need to be opera- 
tional every minute. However, it should not be down for extended 
periods such that the system cannot clear its backlog of updates. A 
database update operation comprises, in general, several disk writes. 
A database transaction is commonly defined as a sequence of updates 
that transforms the database from one consistent state to another 
consistent state. Because minimizing the number of disk accesses per 
update is the main concern in the dbas design, and adopting database 
transactions to handle updates would have incurred more disk ac- 
cesses, dbas does not have the concept of a database transaction. It 
follows that the database may become inconsistent if the system fails 
in the middle of an update. The most undesirable way to restore the 
consistency of the database is to reload the database. Since it takes 
more than three days to reload an average dbas database, the consist- 
ency of the database must be maintained under the condition of system 
failure so that reloading becomes unnecessary. This is solved by 
performing periodic database checkpoints. At each database check- 
point, a consistent copy of the database is saved on disks. When the 
system is restarted after a failure, the most recent consistent copy, 
saved at the last database checkpoint before the failure, is used, and 
reloading of the database is avoided at the expense of losing the 
updates entered between the last checkpoint and system failure. 

As a further precaution to confine the catastrophe due to fatal disk 
i/o errors to a small region, we partition the database disks into read- 
only disks and writable disks. The read-only disks contain the most 
consistent copy of the database at the end of each day. The writable 
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disks are initially empty at the beginning of the day, and contain an 
increasingly larger portion of the database as the day progresses. Fig. 
4 illustrates the update effects on the writable disks. At the end of the 
day, the contents of the writable disks are all merged onto the read- 
only disks to give a new consistent copy of the database. The proba- 
bility of a hardware write-protected disk drive having a fatal i/o error 
is much smaller than that of a drive permitting both read and write. 
Consequently, almost all fatal disk i/o errors are confined to the 
writable disk drives. The work lost due to a fatal disk i/o error is 
therefore limited to one day's work of updates. A duplex system was 
also considered for dbas database and rejected because of its cost. 

3.4 Secondary storage management 
3.4.1 Introduction 

The real-time UNIX operating system supports both contiguous and 
noncontiguous files. The noncontiguous file has the advantage that the 
management of free pages is part of the file system function. However, 
the number of disk accesses to retrieve a page of a large (noncontig- 
uous) file is too many for the dbas application. In the case of contig- 
uous files, the file system uses the concept of multiple extents to 
provide the capability of file growth and shrinkage. All these advan- 
tages can be fully utilized if the sizes of the files are not bigger than 
the size of the host file system. The major difficulty lies in the 
restriction that the size of a file system can not exceed the capacity of 
a special device (174 million bytes in the case of RP-06). In other 
words, a file system in the UNIX operating system does not span more 
than one disk drive. The dbas database needs a file of size much larger 
than one RP-06 disk. This large file would have to be artificially 
partitioned into multiple file systems if one insisted on using those 
provided by the UNIX operating system. The negative impacts include 
(i) unusually large number of file systems have to be mounted at the 
same time when the mount points are already scarce resources in most 
systems, (ii) an unusual number of files would have to be introduced 
to take advantage of the space-management facility of the UNIX 
operating system, and (Hi) more file-open and file-close operations 
would have to take place due to the limitation on the number of files 
that are allowed to stay open at any time for each process. These 
negative impacts obviously affect the dbas throughput objective. 

The following considerations are noted in the design of a special 
Secondary Storage Management module (ssm) for dbas use. (See 
Table III for a list of ssm features.) 

3.4.1.1 Transparency of multiple database disks. The ssm should make 
the distinction between multiple database disks and a single database 
disk transparent to the file-access method. On the other hand, the ssm 
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Fig. 4— Illustration of the update effects on the writable disk. 

should separate the read-only disks from the writable disks such that 
data on the read-only disks can take advantage of the hardware write 
protection feature. 

3.4.1.2 Efficiency. Allocation and deallocation of a block should 
require almost no disk accesses. Reading (or writing) a block from (or 
to) a disk takes exactly one disk access. Data movement in reading or 
writing a block should also be minimized. 

3.4. 1 .3 Contiguous disk space. Reading from (or writing to) successive 
disk blocks requires fewer disk seeks than from (or to) disjoint blocks. 
In general, the number of i/o calls is the same as the number of blocks 
being moved from (or to) the disk. The ssm can provide a further 
improvement in efficiency by issuing exactly one i/o call in moving 
multiple blocks when the source and the destination are known to 
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Table III — Secondary storage-management 
features 



Up to 6 RP-06 disks 

2K-byte disk blocks 

Write 1 to 4 contiguous blocks in 1 system call 

Separation of read-only and writable disks 

Contiguous free-space management 

Fragmented free-space management 

Data placement heuristics 

Support for database checkpoint 

Support for multiple copies of atps 



occupy contiguous space. Since the DBAS updates are confined to the 
writable working disk during the day and merged to the read-only disk 
each night, all blocks on the working disk are free in the morning when 
the system starts. Moreover, all read-only disk blocks are also, ob- 
viously, free when an initial database load starts. Careful management 
of contiguous space cuts down daily update time and the long period 
required for database load. 

3.4. 1.4 Fragmented disk space. The choices of data structures for 
managing fragmented disk space include, among others, the following. 

(i) A free-page bit vector. The position and the binary value of a 
bit is used to represent, respectively, one page and its allocation status. 
One bit vector is required for each database disk. 

(ii) External free-page address file. A file, external to the database 
disk space, is used to record all the free-page addresses. 

(Hi) Internal free-page address file. A linked list of blocks, where 
each block is a part of the database disk space, is used to record all the 
free-page addresses. 

The first two data structures occupy space that can otherwise be 
allocated for other purposes. The third one, similar to the structure 
used in most cases to manage the main memory free blocks, occupies 
the spaces that are free and unused due to fragmentation. 

3.4.1.5 Checkpoint support. When modifying a block, if the old 
content is needed for the purpose of restart after a system crash, then 
the old block cannot be over-written, and a new shadow block must be 
used for the storage of the modified content. Even though the contents 
of the old block are obsolete, the deallocation of the old block for reuse 
must be deferred until the shadow block is checkpointed. The ssm 
should manage the deferred free blocks efficiently to support the 

database checkpoint. 

3.4.1.6 Data placement heuristics. The bng records in the dbas 

database are obviously accessed thousands of times more frequently 
than the bnr records. The bng record should be placed in a special 
disk area so their accesses cause the least amount of head movements. 
Other considerations in the ssm design include 2K (2048) byte block 
size, disk reconfiguration for database growth, and migration. 
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3.4.2 Implementation 

The picture of the database disks, presented to the file-access 
method by the ssm, is a large, virtual, contiguous file with multiple 
extents of usable and nonusable space. Each of the multiple extents of 
usable space corresponds to a database disk. The blocks, each having 
2K bytes, of this large virtual file are addressed by 32-bit numbers. 
The first five of the 32 bits is used to address a database disk and the 
remaining bits to address the block offset within the disk. 

The multiple extents of usable space of this large virtual file are 
further partitioned into three types of areas. 

(i) BNG area. A set of cylinders close to the center of a read-only 
disk is dedicated for the storage of bng related record. Since the bng 
records are presumed to be accessed at a high traffic rate, the space of 
the remaining areas of the same disk is allocated in a discrete manner 
so that there is always a high probability that the disk head stays 
within the bng area at all times. 

(ii) BNR area. With the exception of the bng area, all other areas 
on the read-only disks are used to store bnr records and other 
miscellaneous data. 

(Hi) Working volume area. The areas on the writable disks are for 
the purpose of update processing. 

Each area of the three types has contiguous unused space and 
fragmented free space. A contiguous unused space is simply managed 
by keeping its size, and its lower- and upper-bound block addresses in 
the system. The fragmented space of an area type is managed through 
the aforementioned internal free-page address file. The names of the 
three (internal) free-page address files are GFL, RFL, and WFL for 
respectively the bng, bnr, and working volume area types. Further- 
more, to achieve efficiency during the allocation and deallocation of 
the blocks of each area type so that no extra disk accesses are incurred, 
a free-page address stack is maintained for each area type. They are 
replenished from, or overflowed to, their respective free-page address 
files when nearly empty or full. 

The management of deferred free pages in supporting database 
checkpoints is similar to that of fragmented free pages of the working 
volume area. A fourth (internal) free-page address file, called DFL, 
and a companion free-page address stack are allocated and operated 
the same way in managing the deferred free pages. 

Finally, when multiple copies of the atps are running, they all need 
to access the ssm data structures. These data structures are all placed 
on the commonly shared ATP-segment and accessed through the use of 
semaphores to avoid any critical section problems from occurring. 

3.5 Buffer management 

A set of page buffers is statically allocated within the user address 
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space to form a buffer pool. A block of data that is moved in and out 
of the database must first be placed in one of these buffers. They are 
time shared among the different page types of the database. The 
objectives are to minimize the amount of data movement and to 
facilitate implementation of the extendible hashing algorithm. The 
following considerations are noted in the buffer management design. 

3.5. 1 Number of buffers 

The data space of 64K bytes allowed to each process on the PDP 
11/70 sets an implied upper limit on the amount of space that can be 
allocated to buffers. The internal operation of the database requires 
page splitting, which can demand 4 to 6 pages to be buffered simulta- 
neously. 

3.5.2 Contiguity of multiple buffers 

Contiguous free space on disk is used to reduce the number of disk 
i/o operations in updating a record. This requires that at least some 
set of buffers in the pool occupy contiguous memory space. 

3.5.3 Buffer cache 

Updating requires two accesses to the same record. A cache buffer 
arrangement is implemented to keep the record in memory between 
these two accesses, while allowing the atp to act on other requests. 
With this arrangement nearly 50 percent of accesses will find the 
record already in memory. This hit ratio is even higher when a large 
number of sequential records are updated, since they will usually go to 
the same page. 

Each buffer has 2K bytes. An atp has a pool of six buffers in its 
address space, four of the six occupying a contiguous 8K shared buffer- 
segment. Buffer-segments are used to provide the buffer cache capa- 
bility. A buffer-segment stack, located on the ATP-segment, is used to 
manage the free buffer-segments. 

To support database checkpoints, the ssm time-stamps each page 
when it is written on disk. When a block is to be copied from a memory 
buffer to a disk, the time stamp decides whether the old disk block can 
or cannot be overwritten. If the time stamp is earlier than the last 
checkpoint time, then it must be saved for the purpose of restart after 
a system crash, and a new disk block, onto which the modified buffer 
contents are copied, must be allocated. Clearly, writing the buffer 
contents to a new disk block may affect pointers in other buffers. The 
accurate processing of this chain reaction is part of the objective of 
the buffer descriptor array. 

The state of the buffers in the buffer pool is described by a buffer- 
descriptor array. Each entry of this array records whether the buffer 
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is free or used, disk block addresses from which and to which the 
contents of the buffer is copied, and the relationship of the buffer 
contents to other buffers. 

3.6 Secondary key retrieval 

A service order containing a future service-effective date is called a 
pending order. Because of the size of the dbas database, it is impossible 
to scan through the entire database to find all the bnr records with a 
given effective service date. The dbas database management system 
provides a restricted secondary key retrieval capability for the purpose 
of processing and administering pending orders. Given a (future) 
service-effective date, the retrieval-by-date function lists the keys 
(telephone station number) of all bnr records that contain the given 
(service effective) date. 

Since the retrieval-by-date function is not expected to be used very 
often, the main concern is to devise a structure so that its maintenance 
during a regular order update will not incur extra disk accesses. An 
inverted list is chosen for the secondary key retrieval. During normal 
updates, new entries of the inverted list are piled on top of the old 
ones in a dedicated main memory buffer similar to a free-page address 
stack. Whenever the buffer is full, its contents are moved to a disk 
block and linked to the rest of the inverted list. 

When processing a retrieval-by-date request, the inverted list is 
sorted and filtered to produce the results. Obsolete entries of the 
inverted list are removed nightly during database merge time. 

3. 7 Database load 
3. 7. 1 Introduction 

The initial loading of the dbas database from magnetic tapes pre- 
pared by an bocs data centers is an expensive, time-consuming process. 
For a large database, it takes four days to load the database from 
scratch to its full size. If regular updates were used to insert one record 
at a time to the database, the loading time would be at least ten times 
longer (e.g. 40 days). The main goal of the dbas database initial-load 
program is to shorten the total database loading time. Features used 
to achieve this goal include (i) a linear depth-first search algorithm 8 to 
avoid repeated writing of the same disk blocks, (ii) taking advantage 
of the contiguous disk space to reduce the number of disk writes, (Hi) 
options of running multiple copies of the database initial-load program 
to increase throughput rate, and (iv) a checkpoint to minimize the 
degree of work loss due to system crash. The following linear depth- 
first search algorithm— which stores all bnr records of a given bng 
record in the database — illustrates where contiguous disk space is used 
to reduce the number of system calls. 
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3.7.2 Store-all algorithm 

Assume that the records R1,R2,...,Rn are prearranged in ascending 
hashed key values K1 <K2<...<Kn. Initially, the binary tree, T, consists 
of only the root. The current buffer, CB, is empty and is assigned the 
logical page address at current level L = 1. 

1. [Iteration]. For i=l,2,...,/i do steps 2 to 7. At the end of this 
iteration, a clean-up operation is done to complete the loading. 

2. [Input]. Read record Ri. 

3. [Output current buffer to disk?] Compute the logical page address 
Pi and level Li from Ki and current binary tree T. If Pi is the same as 
the current buffer logical page address, do step 4. Otherwise, the 
content of CB is stablized. If CB is non-empty, it is written onto a disk. 
A new CB is allocated and initialized, and its logical page address is set 
to Pi. The current level L of CB is set to Li. 

4. [Add record to current buffer]. If the current buffer has room for 
the in-coming record Ri, place Ri on the current buffer and go to step 
2 to process the next record. Otherwise, since the current buffer CB is 
too full for record Ri, do steps 5 to 7. 

5. [Allocate an additional buffer for tree splitting]. Allocate a buffer, 
CB'. 

6. [Grow tree T by splitting contents of CB at level L]. Split the 
contents of the full buffer CB at level L between CB and CB' at level 
L+l: The logical page address of CB and CB' at level L+l are 
LCHILD(Pi, L), and RCHILD(Pi,L). The logical page address of each 
record of the full buffer CB at level L is recomputed, if it agrees with 
the new logical page address of CB at level L+l, it remains in CB. 
Otherwise it is put in CB'. 

7. [Select the new current buffer]. Compute the new logical page 
address Pi from Ki and the newly split tree T. If Pi agrees with that of 
CB, keep CB as the current buffer. In this case, CB' must be empty, 
and CB is still too full for record Ri. Set L to L+l and repeat step 6. On 
the other hand, if it agrees with that of CB', then the content of CB is 
stabilized. If CB is non-empty, it is written out onto disk. Let the new 
current buffer CB be CB', set L to L+l and repeat step 4. 

Note 1. By writing out a buffer contents immediately after it becomes 
stabilized, the description of the algorithm is simplified. In reality, 
buffer contents are not written out until no more contiguous buffers 
can be allocated. Contents of contiguous buffers are moved to consec- 
utive pages on disk in a single system call. This reduces not only the 
number of seeks from (up to) 4 to 1, but also economizes the CPU usage 
during initial load. 

Note 2. The cleanup operation moves what remains in the buffer 
pool to disk. These buffers include the contiguous ones that contain 
records not yet stored on disk, buffers used to build the logical- to- 
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physical address translation, and the buffer containing the bng record 
with the newly constructed binary tree in bit-vector form. 

Note 3. The left child node has logical page address LCHILD(Pi, L) 
= Pi. The right child node has the logical page address RCHILD(Pi,L) 
= 2**L + Pi. 

IV. Experiences 

dbas, being a vital link between an boc and the bvas at network 
control points in supporting the nation-wide mechanized calling card 
service, is a production database management system. Most bocs in 
the Bell System have committed themselves to installing dbas by 
second quarter, 1982. The turnover of the dbas Generic 2DB3 to its 
first customer, Southwestern Bell in St. Louis, Missouri, was right on 
schedule in June 1981. An early Generic, 2DB2, was also on schedule 
in its delivery to New York Telephone in July 1980. The data, collected 
so far from the field, show that the design has very successfully met its 
capacity and throughput rate objectives. 

The following are some of the contributing factors towards the 
success of the project. 

(i) The dbas design team is throughput conscious and goal ori- 
ented. The time and coding complexities of each component have been 
closely monitored throughout the design and implementation stages. 
(ii) The dbas database-management modules comprise approxi- 
mately 30-thousand lines of the C language code. Its modular and 
layered structure has made the debugging and trouble-shooting tasks 
manageable. 

{Hi) The interaction among the application programs is minimized 
through multiple user views (supported by dbas). The programmers, 
developing the aps, do not have to go through cumbersome and error- 
prone tasks in negotiating a common data header among themselves 
in the process of designing and debugging their individual programs. 
The benefit is also apparent in the shorter overall system testing and 
integration time. 

(iv) The evolutionary approach — getting the essential programs to 
work first and partitioning the entire job into two stages (Generics 
2DB2 and 2DB3) — has the function of boosting the confidence and 
morale of the developers and other members of the dbas project in 
delivering the products on schedule. 
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